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Abstract.  We  present  a  very  efficient  semi-supervised  graph-based  algorithm  for  classification  of 
high-dimensional  data  that  is  motivated  by  the  MBO  method  of  Garcia-Cardona  (2014)  and  derived 
using  the  similarity  graph.  Our  procedure  is  an  elegant  combination  of  heat  kernel  pagerank  and 
the  MBO  method  applied  to  study  semi-supervised  problems.  The  timing  of  our  algorithm  is  highly 
dependent  on  how  quickly  the  pagerank  can  be  computed;  we  use  two  different  yet  very  efficient 
approaches  to  calculate  the  pagerank,  one  of  which  proceeds  by  simulating  random  walks  of  bounded 
length.  Overall,  our  method  is  advantageous  for  very  big,  sparse  data,  in  which  the  graph  has  few 
edges,  and  it  produces  good  accuracy  even  if  the  number  of  labeled  instances  is  very  small.  In  fact, 
the  accuracy  of  the  procedure  is  comparable  with  or  better  than  that  of  state-of-the-art  methods  and 
is  demonstrated  on  benchmark  data  sets.  In  addition  to  experimental  results,  we  include  a  thorough 
comparison  of  our  algorithm  to  that  of  Garcia-Cardona  (2014)  and  describe  the  advantages  of  both 
methods. 

Key  words,  heat  kernel  pagerank,  graphs,  graph  Laplacian,  threshold  dynamics,  random  walk 

AMS  subject  classifications. 


1.  Introduction 

Classification  is  an  important  problem  in  machine  learning  and  computer  vision. 
The  goal  is  to  partition  a  data  set  efficiently  and  accurately  into  a  number  of  clusters. 
This  task  occurs  in  numerous  applications,  such  as  medical  or  genomics  data,  spam 
detection,  face  recognition,  financial  predictions,  etc.  and  is  finding  growing  use  in 
text  mining  studies.  In  this  paper,  we  present  an  efficient  algorithm  for  classification 
of  high-dimensional  data  formulated  in  a  graphical  framework  and  motivated  by  [31]. 
The  authors  of  [31]  introduce  a  multi-class  MBO  scheme  on  graphs,  building  upon  the 
graph  Ginzburg-Landau  functional  that  is  provably  close  to  the  graph  cut  measure  of 
a  partition  [4],  and  equivalent  to  minimizing  the  graph  total  variation  functional  of 
the  classification  function.  Total  variation  has  had  a  major  impact  in  low-dimensional 
imaging  processing  in  Euclidean  space  [10-12,  60]  and  its  graphical  cousin  has  been 
studied  in  machine  learning  and  network  science  through  quantities  such  as  the  Cheeger 
cut,  Ratio  cut,  and  modularity  optimization  [40,52]. 

We  approach  the  classification  problem  via  the  use  of  heat  kernel  pagerank,  one  of 
the  many  versions  of  pagerank  that  exist  [34].  The  original  pagerank  algorithm  [58], 
developed  in  the  1990’s,  was  designed  by  Google  to  rank  the  importance  of  websites  in 
search  engine  results.  However,  partly  due  to  its  generality  and  simplicity,  it  was  later 
applied  to  many  other  tasks  in  a  variety  of  fields,  such  as  biology  and  chemistry  [34]. 
Overall,  two  particular  uses  comprise  the  majority  of  the  applications  of  pagerank.  First, 
pagerank  measures  the  “significance”  of  each  node  in  relation  to  the  whole  graph  (i.e. 
in  [45]);  in  other  words,  pagerank  produces  a  ranking  of  the  vertices.  Second,  it  is  often 
used  for  local  graph  clustering  to  find  a  cluster  around  a  particular  point  (i.e.  in  [21]). 

While  most  of  its  uses  consist  of  ranking  and  local  clustering,  personalized  pagerank, 
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which  is  another  version  of  pagerank,  has  been  applied  in  a  few  papers  to  both  unsuper¬ 
vised  and  semi-supervised  learning.  For  example,  in  [66],  the  authors  present  a  top-down 
algorithm  based  on  personalized  pagerank  that  combines  random  walks  and  modularity 
to  accurately  cluster  the  graph  into  two  parts.  In  addition,  in  [23],  Chung  et  al.  for¬ 
mulate  a  procedure  that  uses  personalized  pagerank  vectors  and  an  optimized  jumping 
parameter  to  obtain  a  set  of  clusters.  An  example  of  an  application  of  the  pagerank 
to  semi-supervised  learning  is  presented  in  [70],  where  a  closed-form  expression  for  the 
class  of  each  node  is  derived.  Moreover,  the  authors  of  [50]  describe  a  semi-supervised 
method  for  classifying  data  using  cleverly-designed  random  walks.  While  these  algo¬ 
rithms  produre  good  results,  [66]  is  only  a  binary  clustering  procedure,  [23]  requires 
the  computation  of  a  different  pagerank  for  every  node  and  [70]  involves  solving  a  very 
large  matrix  system.  We  now  present  a  simple,  efficient  and  accurate  algorithm  for 
semi-supervised  learning  based  on  heat  kernel  pagerank. 

When  formulating  the  new  method,  we  use  heat  kernel  pagerank,  which  provides 
several  advantages  and,  to  the  best  of  our  knowledge,  has  never  been  applied  to  the 
semi-supervised  or  unsupervised  machine  learning  task.  One  advantage  of  using  the 
pagerank  is  the  fact  that  it  can  be  computed  very  efficiently,  i.e.  by  the  procedure 
described  in  [21,62],  which  consists  of  simulating  random  walks  of  bounded  length. 
Second,  as  experiments  of  [43]  indicate,  heat  kernel  pagerank  is  able  to  capture  the 
properties  of  real-world  communities  better  than  personalized  pagerank,  and  it  is  more 
accurate  at  detecting  communities.  Third,  heat  kernel  pagerank  represents  an  expo¬ 
nential  sum,  which  converges  more  quickly  in  most  cases  than  the  geometric  sum  of 
personalized  pagerank.  In  addition,  using  the  pagerank  eliminates  the  need  to  compute 
the  eigenvalues  and  eigenvectors  of  the  graph  Laplacian,  which  are  calculations  that 
may  be  computationally  expensive,  especially  in  the  case  of  very  large  graphs.  These 
calculations  are  present  in  [31].  However,  the  method  in  [31]  does  have  its  own  advan¬ 
tages,  and  a  comparison  of  this  algorithm  and  the  one  proposed  here  is  given  in  Section 
5.  Overall,  our  approach  is  particularly  useful  in  the  case  of  large,  sparse  graphs. 

Using  heat  kernel  pagerank  involves  embedding  the  data  in  a  graphical  setting, 
which  provides  several  advantages.  First,  it  allows  procedures  to  exploit  non-local  in¬ 
formation  and  take  into  account  the  relationship  between  any  two  nodes  in  the  data 
set.  In  the  case  of  image  processing,  this  makes  it  easier  for  one  to  capture  repetitive 
structure  and  texture,  as  shown  in  [54].  The  graphical  setting  is  also  very  general,  and 
can  be  easily  applied  to  any  data  set,  i.e.  video  data,  set  of  images,  hyperspectral  data, 
medical  data,  text  data,  etc.  Moreover,  the  framework  provides  a  way  to  analyze  data 
whose  different  clusters  are  not  linearly  separable. 

The  method  we  formulate  is  semi-supervised,  meaning  that  a  set  of  known  label¬ 
ings,  called  the  fidelity  set,  is  to  be  provided  a  priori.  One  advantage  of  our  proposed 
algorithm  is  that,  even  with  a  very  small  fidelity  set  consisting  of  labeled  points,  usually 
only  about  2.5%-10%  of  the  data,  it  is  able  to  obtain  an  accurate  classification.  In 
comparison,  supervised  methods  usually  use  about  60%  and  40%,  or  70%  and  30%,  of 
the  data  for  training  and  testing  sets,  respectively. 

1.1.  Previous  Work 

The  problem  of  classification  involves  exploiting  the  similarities  and  the  structure 
of  the  data  to  properly  cluster  the  set  into  several  groups.  Our  procedure  uses  the 
graphical  framework,  involving  a  set  of  vertices  and  edges  with  weights  assigned  on  the 
edges,  as  a  simple  and  clear  way  to  obtain  the  underlying  similarities  in  a  data  set. 

The  setting  has  been  often  used  in  the  literature  for  that  purpose:  for  exam¬ 
ple,  [2,14,68,71  73]  consider  graphs  for  their  methods.  The  graphical  framework  has 
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been  especially  useful  in  image  processing  applications  [5,25,26,28,35-37,49,61].  Spe¬ 
cific  applications  include  the  image  denoising  method  of  Buades  [8]  and  that  of  El- 
moataz  [28],  which  incorporates  generalizations  of  the  graph  Laplacian  for  the  task  of 
manifold  smoothing  and  image  denoising.  In  addition  to  image  processing,  methods  in¬ 
volving  spectral  graph  theory  [17,56],  based  on  a  graphical  setting,  are  often  formulated 
to  solve  clustering  problems. 

More  recently,  the  graphical  setting  has  been  used  as  a  framework  for  some  opti¬ 
mization  tasks.  In  particular,  a  popular  general  minimization  problem  considered  for 
machine  learning  and  image  processing  is 

min  {£(«)  = -R(u)+F(u)}, 

where  R(u)  is  the  regularization  functional,  F(u)  is  the  fidelity  term  and  u  is  a  function 
defined  on  the  space  of  the  data  set  that  describes  an  appropriate  characteristic  depend¬ 
ing  on  the  problem.  The  R(u)  term  in  some  sense  smoothes  the  data,  and  the  fidelity 
term  allows  one  to  incorporate  any  known  information,  such  as  any  nodes  that  have 
a  known  class.  With  regards  to  the  regularization  term,  the  total  variation  functional 
has  been  used  very  successfully  in  many  image  processing  tasks,  as  shown  in  [13,33,69]. 
Due  to  some  unfavorable  properties  of  the  functional,  the  methods  in  [31,40,54]  use 
the  Ginzburg-Landau  functional  as  a  smooth  but  arbitrarily  close  approximation  to  the 
total  variation  semi-norm.  Our  method  is  motivated  by  these  algorithms,  in  particular, 
the  procedure  described  in  [31],  but  it  uses  heat  kernel  pagerank  to  overcome  some 
computational  hindarances  of  [31]. 

Heat  kernel  pagerank,  described  in  greater  detail  in  [18],  has  been  applied  in  several 
recent  works  to  a  number  of  tasks.  For  example,  [19]  presents  a  local  graph  partitiong 
algorithm  using  heat  kernel  pagerank,  which,  for  a  subset  S  with  conductance  h,  finds 
local  cuts  with  Cheeger  ratio  at  most  0(y/h).  In  [20],  the  authors  descibe  an  algo¬ 
rithm  solving  linear  systems  with  boundary  conditions  using  heat  kernel  pagerank.  The 
method  in  [21]  is  another  local  clustering  algorithm,  which  uses  a  novel  way  of  comput¬ 
ing  the  pagerank  very  efficiently.  An  interesting  application  to  heat  kernel  pagerank  is 
presented  in  [62],  which  outlines  a  method  for  finding  a  consensus  value  in  multi-agent 
networks.  In  addition,  [22]  considers  the  problem  of  computing  a  local  cluster  in  a 
massive  graph  in  the  distributed  setting  using  heat  kernel  pagerank. 

The  classification  algorithm  presented  in  the  paper  is  semi-supervised,  so  that  a 
small  set  of  labeled  data  points  is  provided  to  the  method  in  advance.  Such  problems 
have  been  studied  in  the  context  of  diffuse  interfaces;  for  example,  [3]  introduces  a 
procedure  for  image  processing  and  machine  learning  in  which  the  Ginzburg-Landau 
functional  with  fidelity  term  is  minimized  using  convex  splitting.  Other  semi-supervised 
learning  approaches  involving  the  functional  include  [31,32,53-55].  Semi-supervised 
approaches  derived  using  total  variation  include  [1,51].  The  advantage  of  our  method 
is  that,  in  practice,  only  a  small  percentage  of  ground  truth  needs  to  be  inputted  into 
the  algorithm  to  obtain  good  accuracy.  In  fact,  in  our  experiments,  we  mostly  use  only 
around  5%  or  less  of  the  points  as  fidelity. 

In  the  case  of  unsupervised  methods,  there  is  no  a  priori  information  about  the 
labeling  of  the  nodes  of  the  data  set.  The  method  proceeds  by  clustering  similar  nodes 
together.  To  prevent  trivial  solutions,  a  constraint  on  the  size  of  the  clusters  is  usually 
incorporated  into  the  procedure.  Two  such  constraints  that  have  become  popular  are 
the  normalized  cut  [52,67]  and  the  Cheeger  cut  [15,52,61];  each  of  them  favors  solutions 
where  the  clusters  are  of  similar  size.  The  latter  constraint  has  been  studied  recently  in 
works  such  as  [6,7,38,64]. 
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1.2.  Main  Results 

The  main  contributions  of  the  paper  are  the  following: 

•  We  introduce  a  very  efficient  new  graph-based  semi-supervised  classification 
algorithm,  called  heat  kernel  pagerank  MBO,  for  high-dimensional  data  that 
uses  heat  kernel  pagerank  for  the  main  steps.  The  running  time  of  the  method 
depends  heavily  on  how  fast  the  heat  kernel  pagerank  can  be  computed.  We 
use  two  very  efficient  approaches  for  this  task;  one  of  them  is  formulated  in 
[20-22,62].  Our  algorithm  is  especially  useful  for  large,  sparse  graphs,  and 
produces  accuracy  comparable  with  or  better  than  that  of  state-of-the-art,  even 
when  the  number  of  labeled  instances  is  very  small. 

•  We  validate  our  method  by  performing  experiments  using  benchmark  data  sets. 
We  also  provide  a  thorough  comparison  of  this  algorithm  and  [31],  one  of  the 
state-of-the-art  graph  methods,  and  provide  details  on  the  advantages  of  both. 

The  paper  is  organized  as  follows:  Section  2  provides  some  background  on  the 
graphical  framework  and  heat  kernel  pagerank  and  Section  3  presents  a  model  using  heat 
kernel  pagerank  directly  as  a  classifier.  Section  4  formulates  the  new  algorithm  as  well 
as  provides  details  on  computing  heat  kernel  pagerank,  Section  5  presents  experimental 
results  on  benchmark  data  sets,  and  Section  6  concludes  the  paper  and  also  provides  an 
analysis  on  our  method  and  that  of  [31]. 

2.  Background 

2.1.  Graph  Framework 

In  this  paper,  we  consider  a  graphical  framework  with  an  undirected  graph  G  = 
(V,E),  where  V  is  the  set  of  vertices  and  E  is  the  set  of  edges.  A  weight  function  is 
defined  on  each  edge,  and  represents  a  measure  of  similarity  between  the  vertices  it  is 
connecting.  As  usual,  the  degree  of  a  vertex  x  is  formulated  as 

d{x)  =  ^2w(x,y). 
yev 

Denote  by  D  the  diagonal  matrix  containing  the  degrees  of  vertices  as  diagonal  entries, 
and  let  W  denote  the  similarity  matrix  containing  the  weight  function  values.  Here 
are  some  common  mathematical  operators  on  graphs  (see  [3,36]  for  more  information). 
Let  the  set  of  vertices  V  and  edges  E  be  Hilbert  spaces  defined  via  the  following  inner 
products  and  norms: 


{u,j)v  =  ^2u(x)j(x)d(x)r , 

X 


{^,4>)s 


-J2ip(x,y)</)(x,y)w(x,y)2q  \ 
x,y 


,  =  y/(u,u)v  =  A \^,u(x)2d(x) 


UWs  =  V(Ms  =  J^'52<t>(x,y)2w(x,y)2q-1 
V  x>y 
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M£,oo=max\<f>(x,y)l 

xiV 

for  some  rG  [0,1],  gG  [|,1],  u,  7G  V,  and  1 p,<f>£E.  The  gradient  operator  is  defined  as 
(Vu)w(x,y)=w(x,y)1~q(u(y)-u(x)), 

and  the  divergence  operator  can  be  formulated  as  the  adjoint  of  the  gradient 
(divti,^)(a:)=  1  'Y^w(x,y)q((t>(x1y)-(t>(y,x)), 

V  )  y 

where  the  following  adjoint  equation  is  used:  (S7u,4>)e  =  —  (u,divw<j>)v.  Using  the  gradi¬ 
ent  and  divergence  operators,  one  can  define  a  family  of  graph  Laplacians  Ar  =  clivu,  V  : 
V  — t-  V: 

(A wu)(x)=J2Wl^y  (u(y)-u(x)). 
y  ^ 

We  also  formulate  the  Dirichlet  energy  and  total  variation: 

\  \\Vu\\2£  =  ^^2w{x,y)(u(x)-u(y))2, 
x,y 

TVw{u)  =  ma,x{{divw(j),u}v-.(j)G£1\\(p\\£  oo  <  1}  = -^w{x,y)q\u{x)-u{y)\. 

x.y 

If  u  is  viewed  as  a  vector  in  Rm,  we  can  write 

-Awu=(D1-r-D~rW)u. 

The  expression  D1^r  —  D~rW  forms  a  general  expression  for  the  graph  Laplacian,  which 
is  the  graph  version  of  the  Laplace  operator.  The  parameter  r  allows  the  user  to  control 
the  type  of  Laplacian.  For  example,  the  cases  of  r  =  0  and  r  =  1  result  in  the  unnor¬ 
malized  Laplacian  L  =  D  —  W  and  the  random  walk  Laplacian  Lw  =D~1L ,  respectively. 
Another  popular  version  of  the  Laplacian  is  the  symmetric  Laplacian  Ls  =  D~  5  LD~  2 . 
These  graph  Laplacians  have  the  following  easily  shown  properties: 

1)  L  and  Ls  are  symmetric. 

2)  L,  Ls  and  Lw  are  positive  semi-definite. 

3)  L,  Ls  and  Lw  have  non- negative,  real- valued  eigenvalues  0  =  Ai  <  A2  <  ...  <  A„. 

4)  A  is  an  eigenvalue  of  Lw  with  eigenvector  u  if  and  only  if  A  is  an  eigenvalue  of  Ls 
with  eigenvector  w  =  D?u. 

With  regards  to  the  weight  function  w,  the  goal  is  to  choose  w  so  that  similar 
vertices  are  weighted  by  a  large  value  and  dissimilar  vertices  are  weighted  by  a  small 
value.  While  many  versions  of  weight  functions  exist,  two  popular  ones  are  the  Gaussian 
weight  function 

d(x,y)^ 

w(x,y)  =  e  G2 

and  the  Zelnik-Manor  and  Perona  weight  function  [59],  which  gives  a  local  rescaling, 
and  is  written  as 

_  d(j:.y)2 

w(x,y)  =  e  - 
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where  d(x,y)  is  a  metric  computing  the  distance,  in  some  sense,  between  x  and  y ,  and 
a  is  a  parameter  to  be  chosen.  Also,  \Jt(x)  =d(x,z)  and  2  is  the  Mth  closest  vertex  to 
x,  where  M  is  a  variable  to  be  chosen. 

The  metric  d(x,y)  will  depend  on  the  data  set.  For  example,  in  the  case  of  points  in 
R2 ,  one  might  consider  it  to  be  the  usual  Euclidian  distance,  since  points  far  apart  are 
more  likely  to  be  in  separate  clusters  than  those  closer  together.  When  working  with 
images,  one  might  consider  comparing  neighborhoods  around  pixels  x  and  y  using  the 
L 2  or  L 1  norm.  Of  course,  techniques  like  rotation  invariance  can  also  be  incorporated. 
For  text  classification,  one  can  use  tfidf  (term  frequency  inverse  document  frequency) 
to  form  feature  vectors  for  each  document  and  then  use  cosine  similarity  to  compare 
the  vectors.  In  the  case  of  hyperspectral  data,  one  way  to  formulate  the  metric  d(x,y) 
would  be  to  incorporate  the  hundreds  of  channels  of  each  node. 

2.2.  Heat  Kernel  PageRank  and  Personalized  PageRank 

The  heat  equation,  formulated  as 


du 

dt 


=  A  u, 


occurs  very  frequently  in  many  different  kinds  of  applications.  For  example,  in  financial 
mathematics,  it  is  used  to  solve  the  Black-Scholes  partial  differential  equation.  A  more 
general  version  of  it,  called  the  diffusion  equation,  occurs  frequently  in  chemical  diffusion 
steps.  In  this  section,  we  show  some  ways  of  solving  the  heat  equation  in  a  graphical 
setting: 


du 

dt 


=  — Lu , 


(2.1) 


where  the  Laplace  operator  A  is  replaced  by  the  graph  Laplacian.  Different  Laplacians, 
such  as  Ls  or  Lw,  can  be  used. 

The  heat  kernel  of  a  graph  G  is  the  matrix  e~tL,  which  is  a  solution  to  the  heat 
equation  (2.1).  If  the  symmetric  graph  Laplacian  is  used,  the  heat  kernel  is  T~Lt  =e~tLs. 
This  version  is  of  interest  due  to  the  fact  that  Ls  is  a  symmetric  matrix.  If  the  random 
walk  Laplacian  LW  =  I  —  D~lW  =  I  —  P  is  considered,  the  resulting  heat  kernel  is  Ht  = 
Note  that  P  =  D_1W  is  just  a  transition  probability  matrix  for  the  random 
walk  in  which  one  moves  from  vertex  i  to  vertex  j  with  probability  given  by 
The  heat  kernel  pagerank  is  defined  as  ptj  =  fHt  for  row  vector  /: 


Pt,f  = 


/^=Ee_Vpfc- 


k—0 


(2.2) 


From  (2.2),  one  can  see  that,  in  the  case  that  /  is  a  probability  distribution,  the  heat 
kernel  pagerank  is  just  the  expected  distribution  of  the  random  walk  with  probability 
matrix  P,  where  k  steps  are  taken  with  probability  and  the  starting  point  is 

chosen  from  /.  Therefore,  in  this  case,  one  can  approximate  the  heat  kernel  pagerank 
vector  by  simulating  enough  random  walks  (with  the  number  of  steps  calculated  ac¬ 
cording  to  the  Poisson  distribution,  and  with  the  starting  point  taken  from  /)  to  obtain 
a  close  approximation  to  the  expected  distribution.  This  approach  is  the  one  taken 
by  [21,62]  when  computing  heat  kernel  pagerank. 

In  [62],  Simpson  and  Chung  show  a  useful  application  of  heat  kernel  pagerank. 
Consider  the  following  heat  equation  in  a  graphical  setting: 


u(t)  =  -Lwx(t), 


u(0)eRn. 


(2.3) 
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According  to  Theorem  III. 2  in  [62],  the  solution  to  (2.3)  is  given  by 

u(i)  =  £rV*>,  f  =  u(0)trD,  (2.4) 

where  Mtr  denotes  the  transpose.  Therefore,  the  heat  equation  in  a  graphical  setting, 
defined  using  the  random  walk  Laplacian  can  be  solved  using  heat  kernel  pagerank. 
Other  applications  of  heat  kernel  pagerank  include  finding  the  consensus  value  in  multi¬ 
agent  networks  [62],  local  clustering  [19,21,22]  and  solving  local  linear  systems  with 
boundary  conditions  [20]. 

Another  version  of  pagerank  is  personalized  pagerank  pra,f  ■  It  is  defined  by  the 
following  recursion  formula: 


Vraj  =  otf  +  {l-ot)praJP ,  (2.5) 

where  P  is  a  transition  probability  matrix,  /  is  a  probability  distribution  and  a  is 
a  scalar  between  0  and  1.  The  pagerank  describes  a  surfer  model  in  which  a  web 
surfer,  with  probability  a,  jumps  to  a  page  chosen  from  a  probability  distribution  /, 
but  with  probability  1  —  a,  moves  according  to  the  transition  probability  matrix.  It  also 
represents  the  probability  distribution  of  the  random  walk  arriving  at  each  node.  An 
equivalent  definition  of  the  pagerank  is: 

OO 

WaJ=Y,u{1-*)kfpk-  (2-6) 

k- 0 

Comparing  the  two  different  pageranks  defined  by  (2.6)  and  (2.2),  we  see  that  the  former 
is  a  geometric  sum  and  the  latter  is  an  exponential  sum.  Very  often,  the  exponential 
sum  converges  more  quickly. 

3.  Heat  Kernel  Pagerank  as  a  Classifier 

In  [50],  Lin  and  Cohen  introduce  a  simple  semi-supervised  learning  model  using 
personalized  pagerank  directly  as  a  classifier.  In  fact,  their  method  consists  of  simply 
calculating  the  personalized  pagerank  for  different  /,  and  assigning  the  class  number  to 
each  node  depending  on  which  distribution  gives  a  higher  value  of  the  pagerank  for  that 
node.  Note  that,  very  often,  the  exponential  sum  of  heat  kernel  pagerank  converges  more 
rapidly  than  the  geometric  sum  of  personalized  pagerank.  Therefore,  in  this  section, 
we  describe  a  procedure  based  directly  on  that  of  [50],  but  using  heat  kernel  pagerank 
instead  of  personalized  pagerank.  The  method  is  a  direct  use  of  heat  kernel  pagerank 
as  a  classifier. 

For  a  data  set  of  k  classes,  the  procedure  to  cluster  the  set  into  k  parts  is: 

•  Calculate  {fi,--,fk}  by  setting  ft  at  node  a:  to  1  if  a;  is  a  labeled  node  assigned 
to  class  i.  Otherwise,  set  it  to  0. 

•  Normalize  to  have  the  L 1  norm  of  1. 

•  Calculate  k  different  heat  kernel  pageranks  ptji  for  i  =  l:k. 

•  The  class  of  node  x  is  argmax^pt  j^x)}. 

We  will  refer  to  this  algorithm  as  HKPR  in  Section  5,  where  we  describe  the 
results  of  this  method  and  compare  it  to  the  model  proposed  in  this  paper.  This  simple 
procedure,  which  uses  heat  kernel  pagerank  directly  as  a  classifier,  performs  well  enough; 
in  most  cases,  the  accuracy  is  only  about  1%  —  2%  worse  than  that  of  our  algorithm. 

4.  Semi-Supervised  Heat  Kernel  Pagerank  MBO  Algorithm 


A  semi-supervised  heat  kernel  pagerank  MBO  algorithm  for  data  classification 


4.1.  Motivation 

To  formulate  our  semi-supervised  classification  method,  we  use  a  variational  ap¬ 
proach.  Recall  the  general  minimization  problem  for  data  classification  mentioned  in 
Section  1: 

min  {-E(u)  =  .R(u) +  *■(«)}, 

where  R(u )  is  the  regularization  functional,  F(u)  is  the  fidelity  term  and  u  is  a  function 
defined  on  the  space  of  the  data  set.  Since  the  problem  we  are  solving  is  classification, 
in  this  paper,  u(x)  represents,  in  some  sense,  the  class  representation  of  node  x. 

Regarding  the  R(u)  term,  the  total  variation  functional 

TV  (u)  =  J \Vu\dx 

has  been  used  very  successfully  in  image  processing  applications  as  a  regularization 
term  [33].  In  addition,  many  papers  use  total  variation  to  approximate  the  length  of 
the  boundary  of  the  binary  segmentation  function,  often  referred  to  as  perimeter. 

Let  us  consider  another  related  functional,  called  the  Ginzburg-Landau  (GL)  func¬ 
tional,  originally  proposed  to  describe  physical  phenomena  such  as  liquid-gas  transitions 
and  superconductivity.  It  can  be  formulated  as 

GL(u)  =  —  J \\7u\2dx+  -  J  W(u)dx , 

where  W(u)  is  a  double-well  potential,  such  as  W(u)  =  ^u2(u—  l)2,  and  e  is  a  positive 
constant.  It  has  been  shown  [44]  that  the  Ginzburg-Landau  functional  converges,  as 
e— >0,  to  the  perimeter,  a  quantity  that  is  also  often  approximated  by  total  variation. 
The  relationship  between  the  two  functionals  also  applies  in  the  graphical  framework, 
and  can  be  formulated  in  terms  of  gamma  convergence.  The  result  is  shown  in  [4], 
where  is  it  proven  that  for  any  r,  TVW  is  the  T-limit,  in  the  sense  of  gamma  conver¬ 
gence,  of  a  sequence  of  graph-based  Ginzburg-Landau  (GL)-type  functionals.  Therefore, 
one  can  consider  the  Ginzburg-Landau  functional  to  be  a  smooth  but  arbitarily  close 
approximation  to  total  variation. 

The  advantage  of  using  the  Ginzburg-Landau  functional  for  algorithm  derivations 
lies  in  the  fact  that  it  leads  to  a  simple  minimization  scheme,  while  the  minimization 
of  total  variation  results  in  an  unwanted  nonlinear  curvature  term.  Moreover,  the  GL 
scheme  contains  the  Laplace  operator,  which  allows  for  the  usage  of  heat  kernel  pager¬ 
ank. 

Following  there  ideas,  the  method  in  [54]  is  formulated  by  considering  the  Ginzburg- 
Landau  functional  as  a  regularization  term.  The  functional  plus  fidelity  term  is  mini¬ 
mized  using  gradient  descent,  which  results  in  the  following  equation: 

u(t)  =  —  2eLu(t) - — .  (4.1) 

Here,  note  that  the  Laplace  operator  is  now  replaced  by  the  graph  Laplacian,  since  the 
graphical  framework  is  used.  One  possibility  to  solve  (4.1),  explored  by  [54],  is  to  split 
the  dynamics  into  two  steps  of  first  propagating  by  the  heat  equation  with  an  extra 
term  (derived  from  the  fidelity  term)  and  then  propagating  by  an  equation  involving 
the  double-well  potential.  Another  possibility,  which  we  follow  in  this  paper,  is  to  split 
the  dynamics  into  three  steps: 
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1.  Propagate  by  the  heat  equation: 


u(t)  =  —Lu(t). 


2.  Propagate  by  the  equation  involving  the  fidelity  term: 


u{t) 

3.  Propagate  by  the  following  equation: 


dF 
du ' 


u{t) 


1 

e 


W{u). 


(4.2) 


(4.3) 


(4.4) 


To  obtain  an  approximate  solution  of  (4.1)  at  discrete  times,  one  would  alternate 
between  the  three  steps.  The  advantage  of  this  splitting  is  that  step  1  involves  solving 
only  the  heat  equation,  and  we  pursue  this  splitting  to  formulate  our  method.  The  heat 
equation  is  solved  using  heat  kernel  pagerank. 

Note  that,  in  the  limit  as  e— >-0,  the  last  step  becomes  thresholding;  therefore,  it  is 
equivalent  to 


u(x) 


1 

0  ifv(x)<^, 


where  v  is  the  solution  to  the  second  step  of  the  dynamics.  Due  to  the  binary  answer 
in  the  last  step,  this  procedure  is  applicable  only  to  binary  problems.  Since  we  would 
like  our  algorithm  to  be  applicable  to  any  data  set,  binary  or  not,  the  next  section  will 
adapt  the  procedure  to  an  arbitrary  amount  of  classes. 

4.2.  Algorithm 

To  formulate  the  new  algorithm,  we  first  define  some  notation.  In  the  binary  case 
of  only  two  classes,  it  makes  sense  to  define  u(x)  for  each  a:  to  be  a  real-valued  number, 
representing,  in  some  sense,  one  of  the  two  classes.  However,  in  the  multiclass  case, 
one  has  to  be  more  careful  about  defining  the  label  vector  to  avoid  any  biases  between 
the  classes.  Therefore,  in  the  case  of  m  classes  and  n  data  points,  instead  of  a  vector 
u ,  we  use  a  matrix  U  =  [u-| un],  where  every  node  is  associated  with  a  probability 
distribution  iq  (row  vector)  over  the  to  classes.  In  other  words,  the  jth  entry  of  iq 
represents  the  probability  node  i  belongs  to  class  j.  While  iq  has  dimension  1  x  m,  U 
has  dimension  nx  to.  Moreover,  the  vector  iq  is  an  element  of  the  Gibbs  simplex  Em, 
formulated  as 


Em:= 


(xi,...,Xm)  £  [0,l]m 


We  denote  the  vertices  of  the  simplex  by  e,  for  i  =  1 :  to;  the  vector  et  contains  all  zero 
values,  except  the  ith  entry,  which  is  equal  to  1.  These  vertices  correspond  to  phases 
whose  class  designation  is  known. 

Using  the  standard  Gibbs-simplex  Em,  it  is  not  hard  to  extend  the  theory  covered 
in  the  previous  section  to  an  arbitrary  amount  of  classes.  We  do  not  change  steps  (4.2) 
and  (4.3),  except  that  the  label  vector  will  now  be  a  matrix  U,  where  each  row  is  a 
probability  distribition.  The  third  step,  however,  has  to  be  modified  to  pertain  to  any 
number  of  classes,  so  we  first  project  each  row  of  U  to  the  simplex  (using  the  approach 
formulated  in  [16])  and  then  we  displace  towards  the  closest  vertex  in  the  Gibbs  simplex. 
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The  new  algorithm  thus  consists  of  the  following  steps.  First,  an  initial  U  is  chosen. 
For  fidelity  points,  we  initialize  the  row  of  U  to  be  the  corresponding  vertex  of  the 
simplex.  For  the  rest  of  the  points,  we  choose  the  probability  distributions  randomly. 
To  obtain  the  next  iterate  of  the  matrix  U,  three  steps  are  taken: 

1.  Propagation  using  the  heat  equation: 


U  =  -LU.  (4.5) 

2.  Propagation  using: 

u  =  -f(u)  (4.6) 

3.  Thresholding: 

u  in+1=eh,  (4.7) 

where  vertex  e h  is  the  vertex  in  the  simplex  closest  to  the  projection  of  the  ith 
row  of  the  result  of  step  2  to  the  simplex,  using  the  procedure  in  [16] . 

We  alternate  between  the  three  steps  until  there  is  little  difference  between  the 
current  and  the  previous  iterate,  using  the  metric: 


||U4"+1-U,;"||| 

Kn+1ll! 


<11, 


where  is  a  small  positive  constant.  The  final  classification  is  obtained  by  assigning  to 
node  i  the  class  that  holds  the  highest  probability  in  its  class  distribution  u*. 

Step  (4.5)  is  solved  using  heat  kernel  pagerank,  which  we  calculate  using  two  differ¬ 
ent  approaches,  to  be  described  in  more  detail  in  the  next  section.  To  be  more  specific, 
first  let  Cj  represent  the  ith  column  of  the  result  of  step  (4.5)  and  ci0  represent  the  ith 
column  of  U  at  the  start  of  step  (4.5).  Now,  note  that  (4.5)  consists  of  solving  a  system 
of  heat  equations  in  graph  form: 


u(t)  =  -Lwu(t), 
u(t)  =  -Lwu(t), 

u(t)  =  -Lwu(t), 


■u(O)  =c00  €  Rn 
u(0)  =ci0  €  Rn 

■u(O)  =  Cmg  €  Rn , 


(4.8) 


where  m  is  the  number  of  classes.  We  can  use  heat  kernel  pagerank  to  solve  each  of  those 
problems;  details  were  given  in  Section  2.2.  More  precisely,  it  is  known  that  column  i 
of  the  result  of  step  (4.5),  which  is  the  solution  of  the  ith  equation  above,  is  given  by 

d  =  D-xpXf,  /  =  Ci0trD. 

The  Ci  s  are  then  concatenated  to  form  the  matrix  [c-| . . .  cTO] ,  which  is  the  result  of  step 
(4.5).  Step  (4.6)  is  calculated  explicitly,  i.e. 


Un+1-Un 

dt 


d  F 
du 


(Un). 


The  heat  kernel  pagerank  MBO  algorithm  is  outlined  in  Figure  4.3.  The  version 
of  the  algorithm  that  calculates  heat  kernel  pagerank  using  series  will  be  referred  to  as 
HKPR1;  the  version  that  uses  random  walks  to  calculate  heat  kernel  pagerank  will  be 
referred  to  as  HKPR2. 
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Fig.  4.1:  Heat  Kernel  Pagerank  Approximation  (Series) 


Heat  Kernel  Pagerank  Approximation  (Series) 


Require:  a  row  vector  /,  probability  matrix  P  of  size  nxn,  £eP+,  maximum 
max  number  of  terms  to  include  in  the  series,  tolerance  tol 
*  Initialize  s  —  e~tf,  r\  =  /,  k  =  1,  c—i. 
while  m  >  tol  and  c  <  max  do 

*  r\  <—  r\P 

★  S  +  r2 

*  mi—  ||r2||oo 

★  k  i—  k+ 1,  c  i—  c+1 

end  while 
return  s 


Fig.  4.2:  Heat  Kernel  Pagerank  Approximation  (Random  Walk) 


Heat  Kernel  Pagerank  Approximation  (Random  Walk) 


Require:  a  row  vector  /,  probability  matrix  P  of  size  nxn,  tER+ ,  an  upper 
limit  K  on  length  of  a  walk,  a  number  r  of  random  walks 
Write /  =  /+-/_. 

Initialize  row  vectors  fi+  and  /i_  to  zero  row  vectors  of  length  n. 

for  i  —  1 :  r  do 

7k  Choose  a  starting  vertex  vq  using  the  probability  distribution  jjy^jy  • 

7k  Simulate  a  random  walk  starting  from  vq  using  probability  matrix  P,  where 
the  number  of  steps  k  taken  is  a  Poisson  random  variable  (with  parameter 
t),  and  k<K. 

7k  Let  Vf  be  the  last  vertex  of  the  walk. 

*  /*+(«>/)  <-/*+(«/)+ 1 

end  for 

Repeat  the  for  loop  using  probability  distribution  and  //_ . 

return 


4.3.  Computing  Heat  Kernel  Pagerank 

Step  (4.5)  of  the  new  algorithm  is  solved  using  heat  kernel  pagerank.  Recall  the 
formula  for  heat  kernel  pagerank: 

OO  4-k 

PtJ  =  fHt  =  '£e-t-fPk.  (4.9) 

k— 0 


We  calculate  the  pagerank  via  two  different  approaches:  using  series  and  using  random 
walks,  to  be  described  in  the  next  two  subsections. 

4.3.1.  Computing  Heat  Kernel  Pagerank  Using  Series 

The  first  approach  is  to  sum  each  term  of  the  series  (4.9)  until  the  maximum  value  of 
the  last  term  is  small  enough.  In  other  words,  each  term  of  the  series  (4.9)  is  efficiently 
calculated  until 


k\ 


fPk<e, 
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Fig.  4.3:  Heat  Kernel  Pagerank  MBO  Algorithm 


Heat  Kernel  Pagerank  MBO  Algorithm  for  Classification 


Require:  a  graph  G  of  size  n ,  degree  matrix  D ,  probability  matrix  P,  matrix  U 
holding  the  probability  distributions  over  classes,  number  of  classes  m,  class 
matrix  C  (nx  1  matrix)  holding  the  class  assignments,  tER+,  tolerance  tol 
Initialize  U  to  Uq. 

*U  =  U0,Uold  =  U0. 
while  d<tol  do 

*  U0ld  U 
for  i  —  l:rri  do 

*  Let  Ci  be  the  ith  column  of  U. 

*  Calculate  the  heat  kernel  pagerank  ptj  using  P  with  the  row  vector 
4rD,  using  series  for  version  one  (HKPR1)  and  using  random  walks  for 
version  two  (HKPR2). 

*  Ci<—  D~1p^rf 
end  for 

•k  U  ^  [Cl  •  • .  Cm] 

*  Propagate  U  using 


U=H 


dF 

du 


(U) 


*  Project  each  row  of  U  to  the  simplex  using  [16]. 

*  Replace  each  row  of  U  by  the  closest  vertex  in  the  Gibbs  simplex 

d,  \\u-u°i?\\l 
Ill'll* 


end  while 
for  i  =  1 :  n  do 

C,  —argrnaxiU,  ),  where  U,  is  the  ith  row  of  U. 

end  for 
return  C 


for  some  k  and  e.  This  shortened  sum  approximates  heat  kernel  pagerank.  The  approach 
is  detailed  in  Figure  4.1.  In  practice,  we  calculate  at  most  j  terms  of  the  series,  where 
j  is  a  chosen  variable;  for  the  experiments,  we  use  up  to  12  terms  in  the  series.  We  also 
use  a  sparse  matrix  P,  which  makes  matrix  computations  much  less  costly. 

4.3.2.  Computing  Heat  Kernel  Pagerank  Using  Random  Walks 

To  give  the  motivation  for  the  second  approach,  we  first  assume  that  /  is  a  proba¬ 
bility  distribution.  In  this  case,  fPk  is  the  distribution  on  the  vertices  after  k  random 
walk  steps  (with  probability  matrix  P),  starting  with  a  vertex  chosen  from  the  proba¬ 
bility  distribution  /.  Therefore,  heat  kernel  pagerank  can  be  considered  as  the  expected 

—  t.k 

distribution  of  the  random  walk,  if  k  steps  are  taken  with  probability  e  fc!  ,  with  the 
starting  vertex  chosen  from  /. 

This  motivates  a  random  walk  procedure  to  calculate  heat  kernel  pagerank.  We  first 
calculate  r  random  walks,  where  each  walk  proceeds  using  the  probability  transition 
matrix  P  for  exactly  k  steps,  where  k  is  a  Poisson  random  variable,  starting  with  a 
vertex  chosen  from  /.  However,  since  long  walks  occur  with  low  probability,  we  limit 
the  random  walk  length  to  K  steps.  Then,  as  mentioned  before,  the  distribution  over 
the  final  vertices  of  the  r  random  walks  can  approximate  heat  kernel  pagerank. 

Since  /  might  not  be  a  probability  distribution,  we  first  decompose  it  into  positive 
and  negative  portions,  /+  and  /_,  and  then  normalize  each  part  to  sum  to  one.  In  other 
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words,  note  that 


PtJ  =  fHt  =  (/+  ■ -  f-)Ht  =  1 1/+ 1 1  {  J±yHt }  -  |  |/_ 1 1  {  j^Ht },  (4.10) 

where,  of  course,  and  are  probability  distributions.  The  random  walk  proce¬ 
dure  is  applied  to  both  of  these  probability  distributions,  and  the  pagerank  is  obtained 
using  the  linear  combination  in  (4.10).  The  details  of  this  approach  are  summarized  in 
Figure  4.2.  For  the  experiments,  we  calculate  r  and  K  using  the  formulas: 


r  = 


logic  *) 

loglogie-1)' 


where  we  use  e  =  0.075. 

5.  Experimental  results 

In  this  section,  we  validate  our  method  by  presenting  results  for  experiments  on 
benchmark  data  sets.  The  accuracy  of  our  algorithm  is  comparable  with  or  better  than 
that  of  state-of-the-art  methods.  Even  when  heat  kernel  pagerank  is  used  directly  as 
a  classifier  in  a  very  simple  procedure  (HKPR  method  described  in  Section  3),  the 
performance  is  still  good. 

5.1.  Two  Moons 

This  data  set  is  constructed  from  two  half  circles  in  M2  with  a  radius  of  one.  The 
centers  of  the  two  half  circles  are  at  (0,0)  and  (1,0.5).  A  thousand  uniformly  chosen 
points  are  sampled  from  each  circle,  embedded  in  IR100  and  i.d.cl.  Gaussian  noise  with 
standard  deviation  0.02  is  added  to  each  coordinate.  Therefore,  the  set  consists  of  two 
thousand  points.  Starting  from  some  initial  classification  of  the  points,  the  goal  is  to 
segment  the  two  half  circles.  The  graph  is  constructed  using  the  100-dinrensional  points 
as  feature  vectors,  and  is  sparsified  based  on  10  nearest  neighbors  with  local  scaling 
using  the  10t,!  closest  neighbor.  We  use  fidelity  sets  of  50  out  of  2000  points. 

We  performed  100  different  experiments,  each  with  a  different  fidelity  set.  For  the 
first  version  of  the  new  algorithm  (HKPR1),  we  obtained  an  average  accuracy  of  96.26% 
over  100  runs,  with  each  taking  just  over  a  second  on  average.  The  average  number  of 
iterations  was  3.29.  For  the  second  version  of  the  new  algorithm  (HKPR2),  we  obtained 
an  average  accuracy  of  96.02%  over  100  runs,  each  taking  around  17  seconds.  The 
average  number  of  iterations  was  around  8.29.  When  heat  kernel  pagerank  is  used 
directly  as  a  classifier  (HKPR  algorithm),  the  average  accuracy  was  around  95.73%. 

The  results  of  the  proposed  method  are  shown  in  Table  5.1,  where  our  algorithm 
(two  versions)  is  written  in  italics.  We  also  provide  those  of  other  related  methods 
for  comparison.  While  a  few  of  the  mentioned  methods  are  unsupervised,  they  all 
incorporate  some  sort  of  prior  knowledge,  such  as  a  mass  constraint,  so  we  view  them 
as  comparable  to  our  semi-supervised  approach. 

5.2.  MNIST 

The  MNIST  data  set  [48]  is  composed  of  70,000  28  x  28  images  of  handwritten  digits 
0  through  9.  Examples  of  some  of  the  images  are  shown  in  Figure  5.1.  The  task  is  to 
correctly  assign  each  image  to  a  digit;  this  is  a  10  class  segmentation  problem.  The 
feature  vector  of  each  node  consists  of  784  pixel  intensity  values  present  in  each  image 
of  the  data  set.  To  construct  the  weight  matrix,  we  use  a  graph  sparsified  using  8 
nearest  neighbors  with  local  scaling  based  on  the  8th  closest  neighbor.  Note  that  we  do 
not  perform  any  preprocessing,  i.e.  the  graph  is  constructed  from  the  raw  data  directly. 
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For  the  fidelity  term,  250  images  per  class  (2500  images  corresponding  to  3.6%  of  the 
data)  are  chosen  randomly. 

First,  we  tested  the  full  MNIST  dataset,  consisting  of  70000  images  of  10  digits, 
performing  many  experiments  over  different  fidelity  sets.  For  the  first  version  of  the 
algorithm,  we  obtained  an  average  accuracy  of  97.52%  over  100  runs,  each  taking  just 
22.25  seconds  on  average.  The  average  number  of  iterations  was  8.69.  For  second 
version,  we  obtained  an  average  accuracy  of  97.48%  over  20  different  fidelity  sets  using 
only  around  3.6%  of  the  points.  The  procedure  took  around  1.23  hours  to  run  on 
average.  The  average  number  of  iterations  for  this  method  was  16.  Using  heat  kernel 
pagerank  directly  as  a  classifier  (HKPR  method)  gives  an  accuracy  of  around  96.43%. 

We  also  tested  a  binary  subset  of  MNIST,  consisting  of  only  digits  4  and  9,  resulting 
in  a  data  set  of  13,782  nodes.  This  data  set  was  chosen  because  these  digits  are  hard 
to  distinguish  when  handwritten.  The  graph  and  the  fidelity  set  was  constructed  in  the 
same  way  as  for  the  70K  MNIST  data  set.  We  performed  experiments  with  different 
fidelity  sets.  For  HKPR1,  we  obtained  an  average  accuracy  of  98.28%  over  100  runs, 
each  taking  just  around  1  second  on  average.  The  average  number  of  iterations  was 
14.78.  For  HKPR2,  we  obtained  an  average  accuracy  of  around  98.37%  over  20  runs. 
The  procedure  took  around  8.38  minutes  to  run,  taking  around  34  iterations  on  average. 
The  HKPR  method’s  average  accuracy  was  around  94.94%. 

The  results  of  both  MNIST  experiments  are  shown  in  Table  5.1,  in  addition  to 
those  of  some  of  the  best  methods.  Most  of  the  algorithms  we  compare  with,  such  as 
linear  and  nonlinear  classifiers,  boosted  stumps,  support  vector  machines,  neural  and 
convolution  nets,  are  all  supervised  learning  approaches,  using  60,000  of  the  images 
for  a  training  set  and  10,000  images  for  a  testing  set.  From  the  table,  one  can  see 
that  our  method  performs  very  competitively  with  these  procedures,  and  in  most  cases 
outperforms  them,  even  with  only  3.6%  of  the  nodes  as  fidelity.  Moreover,  we  note 
that,  unlike  the  case  with  some  of  the  algorithms  we  compare  with,  no  preprocessing  or 
feature  extraction  was  performed  on  the  raw  data  in  our  experiments. 


0  1  Z-3  4/ 


Fig.  5.1:  Examples  of  digits  from  the  MNIST  data  base 


5.3.  LFR  Network 

In  their  paper  [46],  the  authors  introduce  a  class  of  graphs  that  are  built  with 
heterogeneous  distributions  of  class  size  and  node  degree,  since  many  real  networks  have 
this  property.  The  degrees  and  class  sizes  are  assigned  from  a  power-law  distribution 
with  power  £  and  /?,  respectively.  The  sum  of  the  class  sizes  must  equal  the  number 
of  nodes  n  in  the  graph.  In  addition,  this  class  of  graphs  has  a  mixing  parameter  /i 
associated  to  them,  which  is  the  fraction  of  edges  each  node  shares  with  other  classes. 
The  maximum  degree  drnax ,  mean  degree  dav ,  largest  class  size  smax ,  smallest  class  size 
Smim  AO  P  are  parameters  that  are  entered  by  the  user.  The  code  to  build  these  data 
sets  is  provided  by  the  authors  of  [46]. 

For  our  experiments,  we  construct  LFR  graphs  with  1000  nodes,  which  we  call 
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LFRl,  and  with  50,000  nodes,  which  we  call  LFR50.  To  construct  the  LFR1  network, 
we  set  the  parameters  dmax,  dav,  smax,  smin,  £,  /?  to  50,  20,  50,  10,  2,  1,  respectively. 
We  vary  the  fi  parameter  from  0.1  to  0.8  in  0.1  increments,  since  it  allows  us  to  control 
the  level  of  difficulty.  A  higher  fi  would  make  it  more  difficult  to  classify  each  node. 
The  number  of  classes  in  each  LFRl  network  was  about  40. 

In  addition,  in  order  to  evaluate  our  algorithm  on  larger  networks,  we  construct 
LFR50  networks,  each  with  50,000  nodes.  To  construct  the  network,  we  first  con¬ 
sider  50  LFRl  networks  Li,...L5o,  each  with  a  mixing  parameter  and  then  connect 
each  node  in  Ls  to  20 /i  nodes  in  Ls+ 1,  chosen  uniformly  at  random,  and  we  note  that 
T51  =L\.  Since  each  LFRl  network  contains  around  40  classes,  by  construction,  each 
LFR50  graph  consists  of  around  2000  classes.  As  in  the  LFRl  case,  we  vary  the  mixing 
parameter  fi  of  the  LFRl  networks  used  to  construct  LFR50,  from  0.1  to  0.8  in  0.1 
increments.  The  mixing  parameter  of  the  resulting  network  is  approximately  j+~-  In 
addition,  because  of  the  way  the  network  is  constructed,  the  graph  has  very  similar 
structure  to  an  LFRl  graph.  Moreover,  the  strength  of  each  node  in  the  network  is 
around  l  +  2/i  times  that  of  it  in  the  LFRl  network,  and  the  total  number  of  edges  in 
the  network  is  scaled  by  around  50(1 +  2/z). 

We  first  evaluate  our  algorithm  on  LFRl  networks,  each  with  a  thousand  nodes. 
For  HKPR1,  our  accuracy  depends  on  the  mixing  parameter  and  the  size  of  the  fidelity 
set.  The  procedure  took  around  1  second.  The  results  are  shown  in  Table  5.1,  where 
we  display  the  results  for  mixing  parameters  0.1,  0.2,  . . . ,  0.8  and  for  different  sizes  of 
fidelity.  The  accuracies  represent  average  accuracies  obtained  over  100  fidelity  sets. 

We  then  evaluate  the  method  on  the  LFR50  networks.  For  HKPR1,  our  accuracy 
again  depends  on  the  mixing  parameter  and  the  size  of  the  fidelity  set.  The  procedures 
took  around  19  minutes.  The  results  are  shown  in  Table  5.1,  where  we  display  the 
results  for  mixing  parameters  0.1,  0.2,  ...,  0.8  and  for  different  sizes  of  fidelity.  The 
accuracies  represent  average  accuracies  obtained  over  100  different  fidelity  sets. 

5.4.  COIL  We  evaluate  our  performance  on  the  benchmark  COIL  data 

set  [14,57]  from  the  Columbia  University  Image  Library.  This  is  a  set  of  color  128  x  128 
images  of  100  objects,  taken  at  different  angles.  The  red  channel  of  each  image  is  then 
downsampled  to  16  x  16  pixels  by  averaging  over  blocks  of  8  x  8  pixels.  Then,  24  of 
the  objects  are  randomly  selected  and  then  partitioned  into  six  classes.  Discarding  38 
images  from  each  class  leaves  250  per  class,  giving  a  data  set  of  1500  data  points.  To 
construct  the  weight  matrix,  we  used  4  nearest  neighbors  with  local  scaling  based  on 
the  4th  closest  neighbor.  The  fidelity  term  is  constructed  by  labeling  10%  of  the  points, 
selected  at  random. 

We  conduct  experiments  over  100  different  fidelity  sets.  For  HKPR1,  we  obtained 
an  average  accuracy  of  91.09%  over  100  runs,  taking  just  around  1  second  on  average. 
The  average  number  of  iterations  was  8.26.  For  HKPR2,  the  average  accuracy  was 
around  91.23%  over  100  runs.  The  average  number  of  iterations  was  around  16,  and  the 
computation  took  around  92  seconds  on  average.  When  heat  kernel  pagerank  is  used 
directly  as  a  classifier  (HKPR  procedure),  the  average  accuracy  is  92.12%. 

The  results  of  the  method  are  shown  in  Table  5.1.  We  also  include  those  of  related 
methods  (those  that  also  use  10%  of  points  for  fidelity);  one  can  see  that  our  results  are 
very  comparable  to  them  or  of  greater  accuracy. 

6.  Conclusion 

The  beginning  of  this  section  will  provide  a  comparison  of  the  proposed  method 
and  [31]  and  show  advantages  of  both.  In  addition,  we  present  a  timing  comparison  of 
the  algorithms  using  benchmark  data  sets  in  Table  5.2. 
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Table  5.1:  Results  for  benchmark  data  sets:  Moons,  MNIST,  LFR  and  COIL 


MNIST  (10  classes) 


Method 

Accuracy 

p-Laplacian  [9] 

87.1% 

multicut  normalized  1-cut  [39] 

87.64% 

linear  classifiers  [47,48] 

88% 

Cheeger  cuts  [64] 

88.2% 

boosted  stumps  [42,48] 

92.3-98.74% 

transductive  classification  [65] 

92.6% 

tree  GL  [30] 

93.0% 

Ai-nearest  neighbors  [47,48] 

95.0-97.17% 

neural/convolutional  nets  [24,47,48] 

95.3-99.65% 

nonlinear  classifiers  [47,48] 

96.4-96.7% 

SVM  [27,47] 

98.6-99.32% 

GL  [31] 

96.8% 

MBO  [31] 

96.91% 

HKPR  (Section  3) 

96.43% 

HKPR1  MBO 

97.52% 

HKPR2  MBO 

97.48% 

Method 

Accuracy 

spectral  clustering  [30] 

80% 

p-Laplacian  [9] 

94% 

Cheeger  cuts  [64] 

95.4% 

binary  GL  [3] 

97.7% 

GL  [31] 

98.41% 

MBO  [31] 

98.31% 

max-flow  [51] 

97.10% 

augmented  Lagrangian  [51] 

97.07% 

HKPR  (Section  3) 

95.73%, 

HKPR1  MBO 

96.26%, 

HKPR2  MBO 

96.02% 

COIL 


Method 

Accuracy 

^-nearest  neighbors  [63] 

83.5% 

LapRLS  [2, 63] 

87.8% 

sGT  [41,63] 

89.9% 

SQ-Loss-I  [63] 

90.9% 

MP  [63] 

91.1% 

GL  [31] 

91.2% 

MBO  [31] 

91.46% 

HKPR  (Section  3) 

92.12% 

HKPR1  MBO 

91.09% 

HKPR2  MBO 

91.23% 

MNIST  (2  classes) 


Method 

Accuracy 

GL  [31] 

98.37% 

MBO  [31] 

98.29% 

max-flow  [51] 

98.48% 

augmented  Lagrangian  [51] 

98.44% 

HKPR  (Section  3) 

94.94% 

HKPR1  MBO 

98.28% 

HKPR2  MBO 

98.37% 

LFR1 


%  of  points  belonging  to  fidelity  / 
Mixing  parameter 

4% 

8% 

12% 

16% 

20% 

0.1 

99.38% 

100% 

100% 

100% 

100% 

0.2 

97.62% 

100% 

100% 

100% 

100% 

0.3 

86.20% 

99.08% 

100% 

100% 

100% 

0.4 

64.13% 

99.08% 

99.96% 

100% 

100% 

0.5 

40.50% 

80.70% 

98.36% 

99.54% 

99.95% 

0.6 

29.36% 

58.34% 

77.84% 

85.77% 

91.70% 

0.7 

21.14% 

35.08% 

45.92% 

56.52% 

65.10% 

0.8 

13.6% 

23.21% 

29.80% 

37.64% 

43.00% 

LFR50 


%  of  points  belonging  to  fidelity  / 
Mixing  parameter 

4% 

8% 

12% 

16% 

20% 

0.1 

100% 

100% 

100% 

100% 

100% 

0.2 

100% 

100% 

100% 

100% 

100% 

0.3 

99.95% 

99.99% 

99.99% 

99.99% 

99.99% 

0.4 

96.87% 

99.31% 

99.55% 

99.70% 

99.81% 

0.5 

70.38% 

93.57% 

97.83% 

99.10% 

99.37% 

0.6 

39.84% 

60.14% 

75.80% 

81.99% 

89.71% 

0.7 

26.26% 

44.75% 

53.69% 

60.20% 

65.56% 

0.8 

16.68% 

28.14% 

35.58% 

41.76% 

46.63% 
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Table  5.2:  Timing  Comparison  (in  seconds) 


data  set/method 

2  moons 

COIL 

MNIST 
(2  digits) 

MNIST 
(10  digits) 

computation  of  eigenvectors 
for  MBO  [31] 

0.55 

0.19 

45.43 

1683.57 

our  method’s  minimization  a 

1.708 

1.461 

1.374 

24.257 

MBO  [31]  minimization  b 

2.30 

1.205 

0.52 

15.4 

“This  is  the  timing  of  the  method  using  already  computed  weights. 
''This  is  the  timing  of  the  method  using  already  computed  weights 
and  eigenvalues/eigenvectors  of  the  Laplacian. 


There  are  many  advantages  of  the  heat  kernel  pagerank  MBO  method.  First  of  all, 
computationally,  it  is  very  efficient,  as  the  main  part  of  the  algorithm  consists  of  com¬ 
puting  heat  kernel  pagerank,  which  can  be  calculated  quickly  using  the  series  approach 
or  the  procedure  in  [21,62].  Another  advantage  of  the  algorithm  to  be  mentioned  is 
that,  unlike  in  the  case  of  [31],  the  eigenvalues  and  the  eigenvectors  of  the  graph  Lapla¬ 
cian  do  not  need  to  be  calculated.  Computing  the  eigendecomposition  might  become 
computationally  expensive  for  large  n,  and  this  is  avoided  in  our  proposed  algorithm. 

The  MBO  method  [31]  has  the  advantage  of  providing  an  efficient  approach  in  the 
case  of  dense  graphs,  partly  due  to  the  Nystrom  extension  procedure  [29]  it  incorpo¬ 
rates.  The  latter  algorithm  computes  an  approximation  to  the  eigendecomposition  of 
the  graph  Laplacian  of  a  dense  graph  very  quickly  using  a  randomly  chosen  subset  of 
the  nodes.  The  eigenvectors  and  eigenvalues  are  then  used  to  proceed  with  the  MBO 
method;  in  fact,  the  graph  weights  themselves  are  not  needed  for  computations  after 
the  eigendecomposition  has  been  calculated.  Another  advantage  of  [31]  is  the  fact  that, 
even  in  the  case  of  a  very  large  amount  of  classes,  the  procedure  is  extremely  efficient, 
due  to  the  spectral  approach  utilized  at  the  main  step.  In  addition,  the  method  is 
very  simple,  requiring  one  to  alternate  between  two  uncomplicated  steps  until  a  certain 
stopping  criterion  is  satisfied. 

In  summary,  we  have  introduced  a  new  efficient  semi-supervised  algorithm  for  clas¬ 
sification,  with  results  that  are  competitive  with  or  better  than  those  of  state-of-the-art 
procedures,  and  accuracy  that  is  high  even  when  the  number  of  labeled  points  is  low. 
This  method,  motivated  by  the  approaches  of  [31],  introduces  the  usage  of  heat  kernel 
pagerank  for  its  main  steps.  Overall,  this  procedure  is  especially  useful  for  large,  sparse 
graphs  and  forms  a  powerful  approach  to  the  classification  problem. 

Acknowledgements.  We  would  like  to  thank  Zach  Boyd  for  helpful  suggestions 
and  useful  discussions. 
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